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Abstract 


We  develop  a  theoretical  analysis  of  generalization  performances  of  regularized  least- 
squares  on  reproducing  kernel  Hilbert  spaces  for  supervised  learning.  We  show  that 
the  concept  of  effective  dimension  of  an  integral  operator  plays  a  central  role  in  the 
definition  of  a  criterion  for  the  choice  of  the  regularization  parameter  as  a  function 
of  the  number  of  samples.  In  fact  a  minimax  analysis  is  performed  which  shows 
asymptotic  optimality  of  the  above  mentioned  criterion. 
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1  Introduction 


In  this  work  we  investigate  the  choice  of  the  regularization  parameter  for  the  regu¬ 
larized  least  squares  algorithm  (RLS)  on  a  reproducing  kernel  Hilbert  space  (RKHS) 
in  the  regression  setting.  This  problem  has  already  been  extensively  studied  in  the 
statistical  learning  literature.  Probabilistic  upper  bounds  on  the  excess  risk  of  the 
empirical  estimators  are  known  and  usually  involve  suitable  priors  on  the  regression 
function.  In  particular  we  recall  that  in  [11]  optimal  rates  are  established  assuming 
some  smoothness  condition  on  the  regression  function.  In  [3]  a  covering  number 
technique  has  been  used  to  obtain  explicit  bounds  expressed  in  terms  of  suitable 
complexity  measures  of  the  regression  function.  In  [5], [20],  the  covering  techniques 
have  been  replaced  by  estimates  of  integral  operators  through  concentration  in¬ 
equalities  of  vector  valued  random  variables.  Although  expressed  in  terms  of  easily 
computable  quantities  the  last  bounds  lack  of  nearly  any  information  about  the  ac¬ 
tual  structure  of  the  kernel  in  use.  We  show  that  such  information  can  be  exploited 
to  obtain  tighter  bounds.  The  approach  we  consider  is  a  refinement  of  the  functional 
analytical  techniques  presented  in  [5].  The  central  concept  in  this  development  is 
the  effective  dimension  which,  roughly  speaking,  counts  the  number  of  degrees  of 
freedom  associated  to  the  kernel  and  the  marginal  probability  measure  for  a  given 
condition  number.  The  concept  of  effective  dimension  was  recently  used  in  [26]  and 
[13]  in  the  analysis  of  the  performances  of  kernel  methods  estimators.  Indeed  in  this 
work  we  show  that  effective  dimension  plays  a  role  in  the  definition  of  an  effective 
rule  for  the  choice  of  the  regularization  parameter.  In  fact  we  prove  that  this  rule  is 
somehow  optimal  for  the  minimax  problems  induced  by  a  certain  family  of  priors. 


Since  the  effective  dimension  depends  on  both  the  kernel  and  the  marginal  proba¬ 
bility  distribution  on  the  input  space,  our  choice  for  the  regularization  parameter  is 
not  completely  distribution  independent.  In  fact  the  spectrum  of  the  integral  oper¬ 
ator  depends  dramatically  on  the  marginal  distribution  but  not  on  the  dimension 
of  the  ambient  space.  These  considerations  raise  the  question  whether  the  effective 
dimension  could  be  estimated  by  unlabelled  data. 


The  work  is  organized  as  follows.  In  sections  2  we  recall  very  briefly  the  main 
concepts  of  statistical  learning  theory  in  the  classical  setting  [4], [9], [16].  In  section  3 
we  fix  the  notations  and  define  the  mathematical  objects  which  will  be  considered. 
Furthermore  we  prove  some  preliminary  results  on  the  structure  of  RLS  estimators 
and  concentration  of  measure  for  vector  valued  random  variables.  In  section  4  we 
prove  the  probabilistic  upper  bound  for  the  excess  risk  of  RLS  estimators  using  the 
concept  of  effective  dimension.  In  section  5  we  give  an  explicit  rule  for  the  choice 
of  the  regularization  parameter  and  compute  the  corresponding  uniform  rates  of 
convergence  in  probability  of  the  actual  risk  to  its  minimum.  Finally  in  section  6 
using  entropy  estimates  we  prove  that  the  rates  obtained  in  the  previous  section  are 
indeed  the  optimal  ones  for  the  relevant  minimax  problem. 
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2  Learning  from  examples 


We  briefly  recall  some  basic  concepts  of  statistical  learning  theory  in  the  regression 
setting  (for  details  see  [23],  [9],  [17],  [2]  and  references  therein). 

In  the  framework  of  learning  from  examples  there  are  two  sets  of  variables:  the  input 
space  X  and  the  output  space  f  Cl.  The  relation  between  the  input  x  £  X  and 
the  output  y  £  Y  is  described  by  a  probability  distribution  p(x,y )  =  u(x)p(y\x) 
on  X  x  Y ,  where  v  is  the  marginal  distribution  on  X  and  p(-\x)  is  the  conditional 
distribution  of  y  given  x  £  X.  The  distribution  p  is  known  only  through  a  sample 
z  =  (x,  y)  =  ((aq,  yi), . . . ,  (x£,  yi)),  called  training  set,  drawn  independently  and 
identically  distributed  (i.i.d.)  according  to  p.  Given  the  sample  z,  the  aim  of  learning 
theory  is  to  find  a  function  fz  :  X  R  such  that  fz(x)  is  a  good  estimate  of  the 
output  y  when  a  new  input  x  is  given.  The  function  fz  is  called  estimator  and  the 
map  providing  fz,  for  any  training  set  z,  is  called  learning  algorithm. 

Given  a  function  /  :  X  — >  R,  the  ability  of  /  to  describe  the  distribution  p  is 
measured  by  its  expected  risk  defined  as 

I[f]  =  [  ( )  -  y)2  dp(x,  y), 

JXxY 

and  the  regression  function 


fP{x)=  /  ydp(y\x), 

Jy 

is  the  minimizer  of  the  expected  risk  over  the  space  of  all  the  measurable  real 
functions  on  X .  In  this  sense  fp  can  be  seen  as  the  ideal  estimator  of  the  distribution 
probability  p.  However,  the  regression  function  cannot  be  reconstructed  exactly  since 
only  a  finite,  possibly  small,  set  of  examples  z  is  given. 


To  overcome  this  problem,  in  the  framework  of  the  regularized  least  squares  algo¬ 
rithm,  [24],  [15],  [2],  [27],  a  Hilbert  space  hi  of  real  functions  on  X  is  fixed  and  the 
estimator  fzx  is  defined  as  the  solution  of  the  regularized  least  squares  problem, 


min 

fen 


£(/(**)  - y *)2  + 

i=  1 


21 

H  J> 


(1) 


where  A  is  a  positive  parameter  to  be  chosen  in  order  to  ensure  that  the  discrepancy 


nx 


is  small  with  hight  probability.  Since  p  is  unknown,  the  above  difference  is  studied 
by  means  of  a  probabilistic  bound  which  is  a  function  depending  on  the 

regularization  parameter  A,  the  number  i  of  examples  and  the  confidence  level  1  —  //, 
such  that 


P 


inf /[/]  <  B(X,  t,  rj) 


>1  —  rj. 
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In  particular,  the  learning  algorithm  is  consistent  if  it  is  possible  to  choose  the 
regularization  parameter,  as  a  function  of  the  available  data  A  =  X(£,  z),  in  such  a 
way  that 


lim  P 

»+0O 


/[/zA(£’Z)] 


inf /[/]>  = 


=  0, 


(2) 


for  every  e  >  0.  The  above  convergence  in  probability  is  usually  called  (weak)  con¬ 
sistency  of  the  algorithm  (see  [7]  for  a  discussion  on  different  types  of  consistency) . 


3  Notations,  assumptions  and  preliminary  results 


We  assume  that  the  input  space  X  is  a  separable  metric  space  and  the  output  space 
Y  is  a  closed  subset  of  M.  We  let  Z  be  the  product  space  X  x  Y ,  which  is  a  separable 
metric  space. 

We  denote  by  v  the  marginal  distribution  on  X  and  by  p(y\x)  the  conditional 
distribution  of  y  6  Y  given  x  G  X.  Let  L2(Z,  p,Y)  be  the  Hilbert  space  of  square 
integrable  functions  on  Z  with  respect  to  p  and  we  denote  by  ||-||  and  (•,  •)  the 
corresponding  norm  and  scalar  product.  Similar  notation  we  use  for  L2(X,  px,Y). 
Moreover  we  assume  that  v  is  not  degenerate,  i.e.  all  the  non-void  open  subsets  of 
X  have  a  strictly  positive  measure. 

We  assume  that  the  space  7i  is  a  reproducing  kernel  Hilbert  space,  [1],  [18],  with  a 
separately  continuous  kernel  K  :  X  x  X  — >  M  such  that 

k  =  sup  K(x,x)  <  Too.  (3) 

x&X 

As  usual  (-,  -)n  and  ||-||w  denote  the  corresponding  scalar  product  and  norm.  Since 
K  is  a  separately  continuous  bounded  kernel  and  X  is  separable,  7i  is  a  real  separa¬ 
ble  Hilbert  space  whose  elements  are  real  continuous  functions  defined  on  X,  [18]. 
Moreover,  given  x  £  X  the  function  Kx  =  K(-,x)  belongs  to  Ti  and  the  following 
reproducing  property  holds 


and  (3)  ensures 


f(x)  =  (f,Kx)n  V/eH 
\\Kx\\h  =  K(x,x)  <  k. 


(4) 

(5) 


We  are  ready  to  state  the  hypotheses  on  the  probability  measure  p.  Firstly  we  will 
assume  the  condition 


dp(x,y)  <  Too. 


(6) 


Moreover  we  will  assume  that  some  f-j-i  £  Tt  exists  which  attains  the  infimum  of  the 
expected  risk,  that  is 

I[fn}=  inf  I[f\,  (7) 
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and  that  for  some  M  >  0 


\y  -  fn{x)\2  <  M  p-a.s. 


(8) 


Let  us  now  review  some  properties  of  the  operator  A  :  TL  — >  L2(Z,  p,Y)  defined  as 
follows 

(Af){x,y)  =  (f,Kx)n. 

Equation  (4)  implies  that  the  action  of  A  on  an  element  /  is  simply 

(■ Af)(x,y )  =  f(x)  VrGl,  feH, 

that  is,  A  is  the  canonical  inclusion  of  TL  into  L2(Z,  p,Y),  where  the  variable  y  is 
dumb.  However,  A  changes  the  norm  since  ||/||^  is  different  from  ||/||  .  The  main 
properties  of  the  operator  A  are  summarized  in  the  following  proposition. 

Proposition  1  The  operator  A  is  an  injective  Hilbert- Schmidt  operator  from  TL 
into  L2(Z,  p,Y)  and 


A*fi  = 

/  f>{x,y)Kxdp(x,y) 

(9) 

J  z 

T  ■=  A*A  = 

1  (' >  nix)^Kx  dufix) , 

Jx 

(10) 

where  f  G  L2(Z,  p,Y),  the  first  integral  converges  in  norm  and  the  second  one  in 
trace  norm.  In  particular  T  is  a  trace  class  injective  operator  from  TL  to  TL. 


PROOF.  The  proof  is  standard  and  we  report  it  for  completeness. 

Since  the  elements  /  G  TL  are  continuous  functions,  (5)  bounds  them  by 

1/0)1  =  Kf,Kx)n\  ^  ll/IOII^IO  ^  0*11/10- 

Since  p  is  a  probability  measure,  then  /  G  L2(Z,  p,Y)  and  A  is  a  linear  operator 
from  TL  to  L2(Z,p,Y)  with  \\Af\\p  <  0*11/10,  so  that  A  is  bounded. 

We  now  show  that  A  is  injective.  Let  /  G  TL  and  W  =  {x  G  X  \  f(x)  0},  which  is 
open  since  /  is  continuous.  Assume  now  that  Af  =  0,  i.e.,  f(x)  =  0  for  z^-almost 
all  x  G  X,  then  W  has  null  measure.  Since  v  is  not  degenerate,  W  is  the  empty  set, 
i.e.,  f(x)  =  0  for  all  x  G  X,  so  that  /  =  0. 

We  now  prove  (9).  Since  K  is  separately  continuous,  the  map 

A  3  x  i — *  Kx  G  TL 

is  weakly  continuous  and,  since  TL  is  separable,  is  strongly  measurable.  Hence,  given 
4>  G  L2(Z,  p,Y),  the  map  (x,y)  *— >  cj)(x,y)Kx  is  measurable  from  Z  to  TL.  More¬ 
over,  (5)  gives 

\\<t>{x,y)Kx\\n  <  \fi(x,y)\s/h 
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for  all  x  6  X.  Since  p  is  finite,  f>  is  in  ^(Z^p)  and,  hence,  (x,y)  i— ►  <f>{x,y)Kx  is 
integrable,  as  a  vector  valued  map.  Finally,  for  all  /£  W, 


(j)(x,y)  (Kx,f)ndp(x,y) 


(4>,  Af)p  =  (A*<j>,  f)n 


so  (9)  holds. 

Equation  (10)  is  a  consequence  of  (9), the  fact  that  the  integral  commutes  with  the 
scalar  product  and  the  definition  of  marginal  distribution  v. 

We  now  prove  that  A  is  a  Hilbert-Schmidt  operator.  Let  (en)ngpj  be  a  Hilbert  basis 
of  TL,  which  is  separable.  Since  A* A  is  a  positive  operator  and  \(Kx,en)n\2  is  a 
positive  function,  by  monotone  convergence  theorem,  we  have  that 


TY(H*H)  =  ^ 


lx 


(d-m  H-x  )  Tt' 


x/W 


2  dv{x) 
2  dv(x) 


K(x,  x)  du{x)  <  k 


Jx 

and  the  thesis  follows.  The  properties  of  T  are  an  easy  consequence  of  the  corre¬ 
sponding  properties  of  A.  □ 


The  following  proposition  clarifies  the  role  of  the  operator  A  in  the  context  of 
learning  theory.  The  result  is  well  known  in  the  framework  of  linear  inverse  problems, 
see  for  example  [8] .  With  slight  abuse  of  notation  we  denote  by  y  both  the  variable 
and  the  function  (x,y)  i— >  y,  which  belongs  to  L2(Z,  p,Y)  due  to  (6). 


Proposition  2  If  a  minimizer  fa  of  the  expected  risk  I[f]  exists  on  TL,  then  it  is 
unique,  satisfies 

Tfa  =  A*y.  (11) 


and 


m  -  I\fn]  =  \\A(f  -  fa)  11/ 


Vf(f  -  fa) 


for  all  f  €  TL. 

If  A  >  0  a  unique  minimizer  fx  of  the  regularized  expected  risk 


(12) 


m+mwn 


exists  and  it  is  given  by 

fx  =  (T  +  X)~1A*y  =  (T  +  X)~1T  fa.  (13) 
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PROOF.  Clearly 


I\f\  =  \\Af-y\\p2 

for  all  f  GTL.  Since  fa  is  a  minimizer,  by  differentiation  we  obtain 

{Af,Afn-y)p  =  0  VfeH  (14) 

and  (11)  follows.  The  uniqueness  is  ensured  by  the  injectivity  of  T. 

Given  f  GH 


I[f]  ~  I[fn]  =  \\Af  -  y\\p2  -  \\Afn  -  y\\p2 

=  W  -  fa)  11/  +  2 (A(f  -  fn),AfH  -  y)p 

=  \\A(f-fn)  11/ 


since  the  second  term  is  zero  due  to  (14).  Let  A  =  U \JT  be  the  polar  decomposition. 
Since  U  is  a  partial  isometry  from  the  closure  of  the  range  of  VT  onto  the  closure 
of  the  range  of  A 


\\A(f-fn)\\p 


Vf(f  -  fa) 


Finally,  (13)  follows  taking  the  derivative  be  equal  to  zero.  □ 


Let  now  z  =  (x,  y)  =  ((aq,  y\), . . . ,  (x£,  ye))  £  Zf  be  the  training  set.  The  above 
arguments  can  be  repeated  replacing  the  measure  p  with  the  empirical  measure 
pz  =  j  d(xi,yi)  where  <5(x,y)  is  the  Dirac  measure  at  point  (x,  y)  £  Z.  An 
element  0  £  L2(Z,pz)  is  completely  specified  by  the  vector  w  £  given  by 

w  i  =  (/>{xi,yi) 


with  the  condition  that  w,-  =  w j  whenever  (xj,j/i)  =  ( Xj,yj ).  In  the  following  we 
represent  the  elements  of  L2(Z,pz)  as  vectors  in  where  scalar  product  is  given 
by 


w.  w 


L*(Z,Pz)  l 


1  ^ 
E 


W  7  W 


i—  1 


We  now  get  a  discretized  version  of  A  by  defining  the  sampling  operator  [19] 


Az-.n^L2(Z,pz) 

(■ Azf)i  =  (/,  KXi)n  =  f(xi )  Vi  =  1, . . . ,  t. 

The  main  properties  of  the  sampling  operator  are  given  by  the  following  proposition. 

Proposition  3  The  sampling  operator  Az  :  Tt  — ►  L2(Z,pz)  is  a  finite  rank  operator 
and 
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As*w  =  -  E  w’Kxt  W  G  L\Z,  pz) 
i= 1 

1  J' 

'■=  Az  Az  =  -  sy'j  (•,  KXi)^KXi. 

i= 1 

If  A  >  0  a  unique  minimizer  fzx  of  the  regularized  empirical  error 

~  Vi)2  +  M\f\\n 

i= 1 


exists  and  is  given  by 


/zA  =  (Tx  +  A)-1^y. 


(15) 

(16) 


(17) 


PROOF.  The  content  of  the  proposition  is  a  restatement  of  Proposition  1  and  the 
fact  that  the  integrals  reduce  to  sums. 


Finally  we  need  the  following  probabilistic  inequality  due  to  Pinelis  and  Sakhanenko 
[14],  [25], 

Proposition  4  Let  (Ll,P,P)  be  a  probability  space  and  £  be  a  random  variable  on 
fl  taking  value  in  a  real  separable  Hilbert  space  7i.  Assume  that  there  are  two  positive 
constants  H  and  a  such  that 


IKMII*<?  a.s 

(18) 

e[|I6Iw2]<^2- 

(19) 

Let  l  G  N  and  0  <  r)  <  1 ,  then 


(tui, . . . ,  ujf)  g 


i=  1 


n 


.  H  a  \  2 

s  2 1 7  +  7?)  lc% 


>  1  - 


(20) 


PROOF.  It  is  just  a  restatement  of  Th.  3.3.4  of  [25],  see  also  [21].  Consider  the 
probability  space  and  the  set  of  independent  random  variables  with 

zero  mean  f;(iu i, . . .  ,a^)  =  £(c Vi)  —  E[£]  defined  on  The  fact  that  fi  are  i.i.d  and 
conditions  (18),  (19)  ensure  that 


116. 


s*ii  n 

E[||6IIW2] 


a.s 


8 


so  that,  for  all  m  >  2  it  holds 


EE[II&H n\  <  ^rn\B'2H 


2  Tjrri—2 


i=  1 


with  B 2  =  £a2.  So  Th.  3.3.4  of  [25]  can  be  applied  and  it  ensures 


£(«(*) -e[«]) 


1=1 


> 


xB 


<  2  exp  — 


x 


for  all  x  >  0.  Letting  5  =  we  get  the  equation 


-f  — l2 

Q  V  D  / 


1 


MV-2 


2(1  +  xHB 


=  log-, 


-i' 


2V5'  1  +  18HB-2  2(1  +  5Ha~2)  b  r\ 

since  B 2  =  M2.  Defining  t  =  5Ha~2 

la2  t 2  ,  2 

iipi  +  t  ~  logv' 

The  inverse  of  the  function  4-r  is  the  function  g[t )  =  i(i  +  Vi2  +  4f)  so 


2=1 


cr2  /2M2.  2 

< 


H 


'l  l a 2 


log 

V 


with  probability  greater  than  1  —  g.  The  thesis  follows  observing  that  g(t)  <  f  +  \/i 


and  2  log  ^  >  w  2  log  ^  >  1 .  □ 


4  Upper  bound 


The  aim  of  this  section  is  to  give  a  probabilistic  upper  bound  on  the  expect  risk  of 
the  solution  given  by  the  regularized  least  squares  algorithm.  The  bound  depends  on 
the  number  of  examples  £,  the  regularization  parameter  and  some  prior  information 
on  the  probability  distribution  p. 

In  the  following,  we  assume  that  the  space  7i  and  the  probability  distribution  p 
satisfy  the  assumptions  (3),  (6),  (7)  and  (8).  Set  the  parameter  A  >  0  we  define 


(1)  the  residual 

A(X)=  fx  -  fn  2=  s/T(fx-fn) 

p 

where  T  is  given  by  (10),  fx  by  (13)  and  fn  by  (7); 

(2)  the  reconstruction  error 


H 


B(X)  =  \\fx-fn 


n 


9 


(3)  the  effective  dimension 


N(X)  =  Tr[(T  +  A)_1T], 
where  the  trace  is  finite  due  to  Proposition  1. 


In  the  framework  of  learning  .4(A)  is  called  approximation  error,  whereas  in  the 
framework  of  approximation  theory  \J £>( A)  is  the  approximation  error.  To  avoid 
confusion  we  follow  the  notation  of  inverse  problems. 


We  are  now  ready  to  state  our  main  result  of  the  section. 


Theorem  5  Let  z  €  Ze  be  a  training  set  drawn  i.i.d  according  to  p  and  /ZA  6  TL  the 
corresponding  estimator  given  by  (17).  With  probability  greater  than  1—rj,  0  <  i)  <  1, 


I[fA-I\fn}<  Cv  I  -4(A)  + 


k2B{ A)  ,  kA( A)  kM  MAf(X)  A 

tx  +~px+  e  ) 


t2X 


+ 


provided  that 


where  C v  =  1281og2(8/r/). 


&  >  max(jV(A),  \J2/CV 


(21) 


(22) 


PROOF.  We  split  the  proof  in  several  steps.  Here  |j-|j  denotes  the  uniform  norm 
of  an  operator  from  hi  to  hi  Let  A,  rj  and  l  as  in  the  statement  of  the  theorem. 
Step  1  :  Given  a  training  set  z  =  (x,  y)  6  ZL ,  (12)  gives 


I[UX]  -  I\fn \ 


v/T(/za  -  fn) 


As  usual, 

/zA  -fn  =  (/zA  -  /A)  +  (/A  -  fn) 

and  (13),  (17)  give 


/zA  -  /A  =  ((Tx  +  A)_1Az*y)  -  ((T  +  A )~1A*y) 

=  (Tx  +  A)’1  {(Az*y  -  A*y )  +  (T  -  TX)(T  +  X)-1A*y} 

(  Eq.  (11)  )  =  (Tx  +  A)’1  |  (Az*y  -  Txfn  +  Txfn  -  Tfn)  +  (T  -  Tx)/A} 

=  (Tx  +  A)’1  (Az*y  -  TxfH)  +  (Tx  +  A)"1^  -  Tx)(/A  -  fH). 

The  inequality  ||/i  +  f2  +  h\\n  <  3(||/i||w2  +  li/2 ||w2  +  ll/ll^2)  implies 

4[/zA]  -  I[fH]  <  3  (»4(A)  +  5i(A,  z)  +  52(A,  z))  (23) 


where 


5i(A,z) 


v/T(Tx  +  A)-1  (Az*y  —  Tx/w) 


10 


2 


S2(A,  z)  =  ||  Vf(Tx  +  A)_1(T  -  Tx)(/a  -  fn) 
Step  2:  probabilistic  bound  on  <S2(A,  z).  Clearly 


n 


S2(  A,z)<  Vf(Tx  +  A)"1 
Step  2.1:  probabilistic  bound  on 


(T-Tx)(/a-/h) 


C(H) 

VT(TX  +  A)-1 


.  Assume  that 
1 


0(A, z)  =  ||(T  +  A)_1(T  —  Tx)||  < 
then  the  Neumann  series  gives 

Vf(Tx  +  A)"1  =  Vt(T  +  A)_1(/  -  (T  +  A)_1(T  -  T*))"1 

+oo 

=  y/T(T  +  A)”1  ((T  +  A)_1(T  -  Tx))n 

n=0 


(24) 


(25) 


so  that 


+oo 


Vf(Tx  +  A)'1  <  v/T(T  +  A)-  1  X;ilU  +  A)-1(T-rx)| 

n= 0 

l  l 


n 

C(H) 


< 


2\/A  1  -0(A,z)’ 


where,  by  spectral  theorem, 


VT{T  +  A)-1 


< 


l 

2\/A ' 


Inequality  (25)  now  gives 


Vf(Tx  +  A)"1 


(26) 


We  claim  that  (22)  implies  (25)  with  probability  greater  than  1— Indeed,  let  C2(TL) 
be  the  Hilbert  space  of  Hilbert-Schmidt  operators  on  TL  (recall  that  (A,  B)Co^  = 
Tr[A*L>]).  Let  us  identify  C2{Tt)  with  and  let  :  X  — >  C2(TL)  be  the  random 

variable 

£i(s)  =  (•,  Kx)h(T  +  A)-1  Kx  =  Kx  ®(T  +  A rlKx. 

Bound  (5)  and  ||(T  +  A)-1 1|  <  j  imply 

and 


e[||^iIIh®w]  — 


I  a: 


lx 


x\\H 


(T  +  A)”1#*  n  du{x) 


<«/  ((T  +  \)-2Kx,Kx)Hdv(x) 

J  X 

=  /eTr[(T  +  A)“2T] 

<  k  ||(T  +  A)-1 1|  Tr[(T  +  A)_1T] 


K 


<  JAA(A)  =  al 


11 


Observing  that 


E[£i]  =  T(T  +  A)’1  -  Y.  ^(*<)  =  (T  +  A)”lrx, 

2=1 


Proposition  4  applied  to  £1  gives 


||(T+  A)-'TX  -  T(T+  A)-'||£jM  <  21og(6/„) 
with  probability  greater  than  1  —  77/3.  Then  for  all  £  £  N  satisfying  (22) 


log(6/ 77)  (  ^  + 


kAA(A)\ 

X£ 


1  1  1 

<-  +  -<- 
“  8  8  “  4 


so  that 

0(A,z)  <  ||(T  +  A)-1rx-T(T  +  A)-1||£2(w)  <  \  (27) 

with  probability  greater  than  1  —  i]/3. 

Step  2.2:  probabilistic  bound  on  ||(T  —  Tx)(/A  —  fn)\\-  Let  £2  :  X  — >  TL  be  the 
random  variable 

&(*)  =  (fx-fn,Kx)HKx. 

Bound  (5)  and  the  definition  of  23(A)  give 


II6(*)IIW  <  II Kx 


m 


f-fn 


H2 


H<*vm=  2 


and 


E[||6||w2]  =  J  \\Kx\\n2(fx  -  fn,Kx)2H  du(x) 
<K(r(fx-fn)Jx-fn)n 
|v/T(/A  -  fH)\  2 


=  K 


=  kA(X)  =  o\. 


Observing  that 


E&]  =  T(fx  -fH)  ~Y  &(*i)  =  Tx(/A  -  fH ), 

1  i=  1 

Proposition  4  applied  to  £2  gives 

|(T-TX)(/A-/W)||  <21og(6/r?) 


2KyS(A)  kA(X) 


I H 


(28) 
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with  probability  greater  than  1  —  rj/3.  Replacing  (26),  (28)  in  (24),  for  all  £  6  N 
satisfying  (22)  it  holds 


o,  f 4k2B(X)  kA(X)\ 

52(A,z)  <  8 log2 (6/?/)  (^^2^  +  ) 


(29) 


with  probability  greater  than  1  —  2r//3. 

Step  3:  probabilistic  bound  on  <Si(A, z).  Clearly 


«Si(A,z)<  v/T(Tx  +  A)-1(T  +  A)5  2  (r  +  A)-5(Rz*y-Tx/w) 


C(H) 


H 


(30) 


Step  3.1:  bound  on  \/T(Tx  +  A)  1(T  +  A)  2  .  Clearly. 


\/T(Tx  +  A)~1(r  +  A)5  =  Vf{T  +  A)"3  {/-  (T  + A)"5(T-TX)(T  + A)-^| 


-1 


Spectral  theorem  ensures  that 


v/T(T  +  A)-3 


<  1  so,  reasoning  as  in  Step  2.1, 


V/T(Tx  +  A)-1(T  +  A)  2 


<  2 


(31) 


provided  that 


(T  +  A)-5(T-rx)(T  +  A)“5 


1 

< 

“  2 


If  B  =  (T  +  A )-2(T  -  TX)(T  +  A)“2,  then 


(32) 


= Ti-  ((r + xy\T  -  tx)(t + a)_i(t  -  tx)) 

=  ((T  +  A)-X(T  -  Tx),  ((T  +  A)_1(T  -  Tx)Y)Co{n) 
<  || (T  +  A)-X(T  -  Tx) ||^  ||  ((T  +  A )-\T  -  Tx))* 


=  KT  +  Ar^T-Tx) 


i£2(W) 

2 


£a(W) 


and,  for  all  £  £  N  satisfying  (22),  (27)  ensures  that  (32)  holds  with  probability 
1  -  2t//3. 


Step  3.2  :  bound  on 
random  variable 


(T  +  X)-2  (Az*y  -  TxfH) 


n 


Let  £3  :  X  x  Y  — »  TL  be  the 


&(x,y)  =  (T  +  A)  2 Kx(y-fH(x)). 
The  definition  of  M  gives 


U3{x,y)\\n  <  (T  +  A) 


n 


\Kx\\nVM< 


kM 


H3 

2 


almost  surely,  and 


13 


E[||&llw  ]=  /  ( y-fn(x))2  (T  +  A) “ 2 Kx  dis(x) 

JXxY  ri 


<M  (0 T  +  \)~1Kx,Kx)Hdv{x ) 

J  X 

=  M  Tr[(T  +  A)_1T]  =  MAf(X)  =  of. 
Equation  (11)  gives 

n&]  =  (T  +  \)-*(A*y-TfH)  =  0, 
so  Proposition  4  applied  to  £3  ensures 


(T  +  A)~2  (Az*y  -  Txfn)\\  <  21og(6/r/)  - 


2  kM 


n 


A 


+ 


) 


with  probability  greater  than  1  —  r]/3.  Replacing  (31),  (33)  in  (30) 

<5r(A,z)  <  321og2(6/??)  (^-  +  , 

with  probability  greater  than  1  —  77. 

Replacing  bounds  (29),  (34)  in  (23), 


(33) 


(34) 


I[fzX]  ~  I[fn\  <  3.4(A)  +  81og2(6/77) 


4k26(A)  +  kA(X)  +  16  kM  +  4MAA(A)  A 


£2X  £X  £2X 
and  (21)  follows  by  bounding  the  numerical  constants  with  128. 


5  A  priori  regularization  parameter  choice 


In  this  section  we  discuss  the  choice  of  the  parameter  A  =  A^  as  a  function  of  the 
number  of  examples  £  in  such  a  way  to  obtain  a  maximal  rate  of  convergence. 

The  following  lemma  studies  the  dependence  of  4(A),  £>( A)  and  Af(X)  on  A.  We  let 
N  be  the  dimension  of  Ti  (possibly  N  =  +00)  and 

N 

T  =  ''y  (  tn  (',  Cn)  Cn 

n= 1 

be  the  spectral  decomposition  of  T  with  0  <  tn+ 1  <  tn  and  (en)^=l  be  a  basis  of  H. 
Lemma  6  With  the  above  notations, 

lim  MiX)  =  N,  (35) 

A^o 

in  particular  if  N  =  +00  and  tn  =  0(n~b )  for  some  b  >  1 

M(X)  =  0(A_5).  (36) 
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Moreover,  if  fn  £  Im  Ta  with  0  <  a  <  1/2,  then 

A(X)<X2a+1\\T~afn\\n2  (37) 

and 

B(X)  <\2a\\T-afn\\n2.  (38) 

PROOF.  Firstly  we  study  Af(\).  Since 

N  t 

JV(A)  =  E  rxi  (39) 

ln  ~r  A 

n=  1 

clearly,  lining  A/"(A)  =  IV.  Assume  now  that  IV  =  +oo  and  tn  =  0(n~b)  with  b  >  1, 
in  fact  since,  by  eq  39,  A7(A)  is  an  increasing  function  of  tn  for  every  n,  without  loss 
of  generality  we  can  consider  the  case  tn  =  n~b .  The  sequence  (tn)neN  is  strictly 
positive  and  decreasing  then,  by  integral  test,  A7(A)  has  the  same  behavior  of 


M{X)  =  l 


1 


1  +  tbX 


dt 


( rb  =  tb  A)  =  A- 


r»+oo  ^ 


I  xi  1  +  rb 

r+o o  i 


dr 


<  A  6 


1  +  T1 


,  dr 


so  that  Af(X)  =  0( A  *>). 


The  results  about  A{\)  and  13(A)  are  standard  [10,8].  We  give  the  proof  for  com¬ 
pleteness.  Equations  (13)  gives 


m 


||(T  +  \)~lT fH  -  fH\\n2 

^(T+ArVwi^2 

N  A 

!  inT  a 
n=  1 


(40) 


Assume  now  that  £  Im  Ta  with  0  <  a  <  1/2.  Since  the  function  xa  is  concave 
on  ]0,  +oo[ 

(t)“<i  +  «v<i  +  t, 

A  A  A 

so  that 

A  _  1  ^  \a 

tn Ta  ~~  i  +  ~ 

then,  replacing  the  above  inequality  in  (40) 

13(A)  <A2a||r-a/w||w2. 
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Reasoning  as  above  we  obtain 


-4(A)  =  VfdT  +  xr'Tfn-  fn) 


H 


N 


tn  T  A 


l\(fn,er 


n=  1 

N  ,a“+^ 

,  Oj~\~  b 


n=  1  tn 

N 


n\ 


^“+IEil(/«’e")«l2  =  ^“TTa/«|| 

n= 1  n 


H 


where  the  inequality  follows  since  the  function  xa+2  is  concave. 


The  following  theorem  analyzes  the  a  priori  choice  for  the  regularization  parameter 
and  the  corresponding  rate  of  convergence  to  the  regression  function  in  a  suitable 
prior  class.  We  use  the  stochastic  order  symbol  Op  defined,  [22], by  the  equivalence 

Xn  =  Op(kn )  T>  lim  limsupP  [|Afn|  >  Tkn\  =  0, 

T-^-OO 

where  (Xn)ngpj  is  a  sequence  of  random  variables  and  (fcn)neN  a  sequence  of  positive 
numbers. 


Theorem  7  Assume  fn  6  Ini  Ta  with  0  <  a  <  1/2  and  for  £  6  N  let  Xg  the  unique 
solution  of  the  equation 

£Xg  =  hf(Xg),  (41) 

where  c  =  2a  +  1.  Then,  with  the  assumptions  of  Theorem  5 


if 


N  <  Too 
N  =  Too, 


0{n~b ) 


then  I[fzXe }  -  I[fn ] 


Op  (r1) 

oP  (r • ) 


where  b  >  1. 


PROOF. 

First  of  all,  from  representation  (39),  we  note  that  jV(A)  is  a  positive,  non-increasing 
continuous  function  of  A.  Then  it  is  clear  that  equation  (41)  has  a  unique  non¬ 
negative  solution  Xg,  and  lirn^,^  A g  =  0. 

Now  let  us  fix  r)  £  (0,1),  and  note  that  since  by  assumption  c  >  1,  there  exists 
t{rj)  £  N  such  that  i  >  £(rj)  implies 

£Xg  =  X\~cN(Xg)  >  ^  max(jV(A^),  yfi /O,)  (42) 


16 


where  we  followed  the  notation  in  Theorem  5.  Since  the  inequality  above  is  equiva¬ 
lent  to  the  constraint  (22),  Theorem  5  can  be  applied,  giving 


iy  |  ^  r  I  A,*  )  ,  k2B{\()  kA{ Xt)  kM  MAf{\e)\ 
\X(\  >  Cv  (  A(Xi)  4  I-  4  -  J 


<  rj,  Mi  >  £{rj) 


where  we  introduced  the  sequence  of  random  variables  X £  =  I[fxXt]  —  I[fn\- 


We  can  simplify  the  form  of  the  upper  bound  above  recalling  that  by  Lemma  6, 
B(Xe)  =  0(1),  which  implies  B(X()i~2Xj1  =  0(i~2Xjl)  such  that  the  fourth 
term  in  the  sum  above  is  not  asymptotically  smaller  than  the  second  one.  Rea¬ 
soning  in  a  similar  way,  from  inequality  (42)  we  see  that  7_1A(71  =  0(1)  such  that 
A{ Xfii^Xj1  =  0(*4(A^))  and  i~2Xj1  =  0(Af(Xe)£^1),  which  means  that  the  first 
and  fifth  terms  are  not  asymptotically  smaller  than  the  third  and  fourth  ones  re¬ 
spectively.  These  arguments  on  asymptotic  orders  of  the  terms  in  the  probabilistic 
bound  above  lead  to  the  conclusion  that  a  positive  constant  C'  and  natural  number 
i'{rj)  exist  such  that 


\Xe\  >  CVC'  (  A{Xe)  + 


a rjM\ 


<  Mi 


Mi  > 


(43) 


The  bound  above  can  be  restated  in  terms  of  stochastic  order  symbols  in  fact,  since 
r]  was  arbitrarily  chosen  in  the  interval  (0, 1),  from  the  definition  of  the  constants 
Cv  it  follows  that 


PQ|  >  T  (  A(Xe)  + 


<  8e~VT/128C\  VT  >  128C'  log2  8  Mi  >  P(T), 

(44) 


and  from  the  definition  of  stochastic  order  symbol  this  implies 


I[fzXe]-I[fn\  =  0P(A(Xe)  + 


Af(Xi)\ 


7  ; 


(45) 


In  order  to  complete  the  proof  we  now  have  to  estimate  the  asymptotic  order  of 
the  sequence  of  positive  numbers  appearing  on  the  r.h.s.  of  the  equality  above. 
This  can  be  accomplished  bounding  by  O(Xf)  both  the  approximation  term  A(X#) 
and  the  sampling  term  Af(X()/i.  The  first  estimate  is  obtained  by  inequality  (37) 
in  Lemma  6,  the  second  one  by  the  very  definition  of  Xg,  equation  (41).  Then 
we  can  furthermore  simplify  the  previous  expression  for  the  stochastic  order  of 
I[fzXe]  —  I[fn]i  in  fact  we  have 

I[fzX(]-I[fH]  =  0P(Xce).  (46) 

Finally  let  us  consider  separately  the  two  cases  in  text  of  the  theorem.  If  IV  <  +oo, 
by  equation  (35)  in  Lemma  6  and  equation  (41),  it  is  clear  that  A^  =  0(7-1),  which 
proves  the  first  statement  of  the  theorem.  If  instead  N  =  +oo  and  tn  =  0(n~b)  for 
some  b  >  1  this  time  from  equality  (36)  in  Lemma  6  and  equation  (41)  it  follows 

a^  =  o(a;*), 

_  fee 

which  implies  A^  =  0{i  6c+1),  proving  the  second  part  of  the  theorem. 
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6  Asymptotic  Optimality 


In  this  section  we  prove  that  the  asymptotic  power  rates  obtained  are  indeed  op¬ 
timal.  The  cited  power  rates  were  derived  assuming  that  the  eigenvalues  of  the 
operator  T  fulfill  the  power  upper  bound  tn  =  0(n~b). 

On  the  other  hand  the  main  result  of  this  section  relies  on  a  power  lower  bound  for 
the  same  eigenvalues,  tn  =  Q(n~b).  It  is  shown  that  under  this  assumption  a  lower 
bound  of  the  same  form  as  the  previous  result  can  be  established  for  the  asymptotic 
rate  of  convergence  for  the  risk. 

These  two  results  allow  to  conclude  that  in  the  case  tn  =  0(n~b),  that  is  if  the 
eigenvalues  of  T  have  a  power  asymptotic  growth,  then  the  asymptotic  rate  of 
convergence  for  the  risk  of  RLS  estimators  obtained  by  the  described  choice  for  the 
regularization  parameter,  is  optimal  conditionally  to  the  marginal  distribution  px 
and  the  prior  fp  £  X . 

We  now  state  and  prove  the  just  introduced  theorem.  Let  us  first  name  px) 

the  set  of  probability  distribution  p  on  X  xY  which  realize  the  marginal  probability 
distribution  px  and  compatible  with  the  prior  condition  fp  £  T .  The  theorem  is 
based  on  a  recent  result,  [6],  expressed  in  terms  of  the  tight  packing  numbers  of  the 
prior  class  T  C  L2(X,px) 

fif(R,px,S,co,ci)  :=  sup{k\  3  (ft)f=i  £  Tk ,  s.t.  Vi  j  c05  <  || -  gj\\v  <  ci<5} 

where  0  <  cq  <  c\  <  oo  are  two  fixed  real  numbers.  The  main  lower  bound  in  ([6] 
Theorem  3.1)  can  be  restated  in  the  following  form 

Theorem  8  Define  ftf(e)  :=  px,  2^/00,  co,  ci).  Suppose  that  for  e  >  0  the 

net  of  functions  (occurring  in  the  definition  of  tight  packing  numbers)  sat¬ 
isfies  llgillcm  —  V4  for  i  =  1,  ■  ■  •  Let  fj  :=  e_3/e;  then  for  all  i  £  N 

inf  sup  P  [/[/z]  >  I[fP\  +  e]  >  min(^,  exp(-8^ecf/^)) 

/■  p&M(F,px)  z~Pe  z 

where  the  infimum  is  over  the  set  of  all  the  learning  maps  z  — ►  fz. 

We  now  specialize  this  general  result  to  our  framework,  that  is  we  consider  the  two 
parameters  family  of  prior  classes 


X{a,R)  :=  {/  £  H\  T~a f  £  H  with  ||T“a/||K  <  R}  (47) 

for  0  <  a  <  1/2  and  R  >  0. 

The  lower  bound  on  the  asymptotic  rate  of  risk  for  general  learning  maps  under 
these  priors  and  fixed  marginal  distribution  px,  can  be  stated  as  follows 
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Theorem  9  Let  us  assume  that  dim  77  =  +oo.  Moreover  let  the  eigenvalues  (tn)n=\ 
of  the  operator  T  fulfill  the  condition  tn  =  Q(n~b)  for  some  b  >  0.  Then  for  every 
0  <  a  <  1/2  and  R  >  0  there  exist  l  and  C  >  0  such  that  for  every  i>i 


inf  sup  P 

f-  p£M{F,px)z~Pe 


I[f*]>i[fp\  +  cr& 


>  V , 


where  T  :=  AT  [a,  R),  c  :=  2a  +  1  and  rj  :=  e  3/e. 


The  proof  of  Theorem  (9)  relies  on  the  following  Lemma  regarding  packing  numbers 
over  sets  of  binary  strings  endowed  with  the  Hamming  distance 

dH0,  <?')  '■=  |{1  <  i  <  K  s.t.  cii  cr'}|, 

where  we  adopted  the  notation  a  :=  {ai)f=l  £  {  —  1,+1}A.  The  proof  relies  on  a 
standard  concentration  of  measure  result,  Hoeffding’s  inequality  [12]  . 

Lemma  10  For  every  K  >  17  there  exists  a  subset  L  of  the  cube  {— 1,+1}A  such 
that 

K 

dH(cJ,  a)  >  —  V  cr,  o'  £  L,  a  /  a' 
and  \L\  >  Bh  where  B  :=  exp(l/24). 


PROOF.  If  cr  :=  (<r f)f=i  is  a  random  point  on  the  cube,  then  the  components  a%  are 
independent  random  variables  distributed  according  to  the  measure  l/2(5_i  +  <5+i). 
Let  a  and  a'  be  independent  random  points  on  the  cube,  then  note  that 

K  K 

dH(cr,  a)  =  Wi  ~  Oil  =  6i 
1=1  1=1 

where  0*  are  independent  random  variables  distributed  according  to  the  measure 
l/2(<5o  +  (^2).  It  follows  that  Hoeffding’s  inequality  can  be  applied  to  the  random 
variable  dH(c,  cr'),  yielding  for  every  5  >  0 

P  [|dH(cr,cr')  -  K\  >  5]  <  2exp(-^). 

Setting  6  =  K/2  in  the  inequality  above,  we  obtain 


P 


dH(<7,  cr')  < 


K 

~2 


<  2  exp(— 


(48) 


Now  draw  m  :=  rBKn  (where  rxn  is  the  lowest  integer  greater  than  x)  independent 
random  points  cr^'i  (j  =  1,  ■  ■  ■  ,m)  on  the  cube.  From  inequality  (48),  by  union 
bound  it  holds 
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p 


K~ 

3  1  <  j,k  <  m,  j  /  k ,  with  dn(o'^,<7'^)  <  — 

<  (m2  -  m)  exp(-^)  <  (B2K  +  BK  +  1)B“3A'  <  1, 

8 

where  we  used  the  inequality  x  <  rxn  <  x  + 1  and  the  assumption  K  >17.  The  last 
inequality  implies  that  at  least  one  subset  of  the  cube  with  the  required  properties 
exists. 


We  are  now  ready  to  prove  the  main  result  of  this  section. 


PROOF.  [Theorem  (9)]  The  proof  consists  in  showing  a  large  subset  ACe  of  the 
the  prior  class  T  :=  AJ-(a,R)  defined  in  (47)  whose  tight  packing  numbers  can  be 
easily  estimated.  Then  Theorem  (8)  will  be  applied  directly  to  ACe  yielding  the 
claimed  lower  bound. 

Firstly  define  R'  :=  min{i?,  k~c/4},  obviously 

.F(a,  R!)  C  T(a,  R). 

Moreover  for  every  /  in  B{a^R!)  it  holds 


\\f\\c(x)  <  «I|/||W  <  «  n  \\T~af\\n  <  kcR'  <  \  (49) 

where  we  used  the  fact  that  ||T||  <  k2.  This  bound  will  be  useful  when  applying 
Theorem  (8)  to  the  subset  ACe  of  AJr(a,R'). 


Let  us  recall  that 


+oo 


T  —  ^  ^  tn  (*)  &n) 


n=  1 


is  the  spectral  decomposition  of  T  with  0  <  tn+ i  <  tn  and  (en)^l*[  is  a  basis  of  H. 
Then  by  definition  (47)  it  is  clear  that 


+oo  +oo 

B(a,  R!)  =  {£  cnen\  £  c2nt~2a  <  R'2}. 

n=  1  n=  1 


Since  by  assumption  tn  =  tt(n  b ),  it  holds 


(50) 


3N',C’>  0  s.t.  Mn>N'  tn>C'n~b.  (51) 

Now  let  us  set  the  constants  /  :=  R'b^C'~b:  e  :=  (f/(2N'  +  34))6c  and  define  the 
family  of  sets 
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Ve  :=  {n  G  N|  N'  <  n  <  fe  *>c};  0  <  e  <  e.  (52) 

From  this  definition  and  property  (51)  it  follows  that 

tcn  >  ei?'-2,  V  n  G  Ve.  (53) 

Furthermore,  due  to  the  constraint  for  the  allowed  values  of  e  in  (52),  the  cardinal¬ 
ities  of  the  sets  T>t  fulfill  the  inequality 


T>e  |  >  max{ 


(54) 


From  the  family  of  inequalities  (53)  and  equation  (50)  it  follows 


JF(«,  B!)  DCe:  = 


(&n)nEVe  £  {  —  1;  +  1}^}. 


Since  by  inequality  (54),  \Ve\  >  17,  we  can  apply  Lemma  (10)  to  the  binary  cube 
{— 1,  +l}l2?eL  This  ensures  us  that  a  subset  Le  of  the  cube  exists  with  cardinality 
\Le\  >  I  for  some  constant  B,  and  such  that  for  all  distinct  a  and  a'  in  Le  it 
holds 


dH(cr,o-')  > 


(55) 


We  are  finally  ready  to  define  the  set  of  functions  ACe  to  which  Theorem  (8)  will 
be  applied.  In  fact  we  have  the  following  straightforward  chain  of  inclusions 


T(a,  Ft’)  DC£D£e 


{&n')n£'De  £  ^e}- 


Let  us  first  notice  that  since  HA/H^  =  [T1/2/!^,  by  inequalities  (55)  it  follows  that 
for  all  distinct  g  and  g'  in  ACe  it  holds 

-  <  \\g  —  g' II  2  <  2e. 
o  —  ii^  ^  ii v  — 

Then  letting  c2  =  2  and  Cq  =  1/2  in  the  definition  of  tight  packing  numbers  for  the 
class  A£6,  it  is  clear  that  ACt  itself  is  a  maximal  \/e-net  of  functions.  Moreover  from 
(49)  \\g\\c{x)  <1/4  for  every  g  G  ACe,  then  we  can  apply  Theorem  (8)  obtaining 
for  all  i  G  N 


inf  sup  P 

/■  p£M{T,px)z~Pe 


'm  >  nip ]  + 

>  inf  sup  P 

>  i[fP\  + 

L  o  J 

/■  p£M{C,px)z~Pe 

L  oJ 

e . 


>  Af(-)  exp(— 32^e)), 
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where  T  :=  AF(a,  R),  C  :=  v4i2e  and  by  inequality  (54)  it  holds 


AT(-)  :=  A 7{ACe,  px,  J~e,  y/lj2,  V2)  =  \Le\  >  B™  >  B  2 


f  -i 

4-  £  be 


The  result  claimed  by  the  Theorem  follows  substituting  in  the  inequalities  above  e 
with  the  expression 

e(£)  =  8 C e_^+i , 

where 

,  be 

iff  \  bc+! 

C  :=  -  —log B 
8  \ 128  6 


and 


£>£:  = 


=  re 


bc+1 

be 


V8  C, 

in  order  to  enforce  the  constraint  e  <  e  and  the  condition 


exp(— 32£e(£))  >  1. 
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